Skip to content

feat(eval): episode sharding, parallel launcher, and autotune#3275

Closed
pkooij wants to merge 2 commits intofeat/async-vector-envfrom
feat/eval-parallel
Closed

feat(eval): episode sharding, parallel launcher, and autotune#3275
pkooij wants to merge 2 commits intofeat/async-vector-envfrom
feat/eval-parallel

Conversation

@pkooij
Copy link
Copy Markdown
Member

@pkooij pkooij commented Apr 3, 2026

Title

feat(eval): episode sharding, parallel launcher, and autotune

Type / Scope

  • Type: Performance / Feature
  • Scope: lerobot/scripts/lerobot_eval.py, lerobot/configs/default.py, new lerobot_eval_parallel.py, new lerobot_eval_autotune.py

Summary / Motivation

Even after PR #3274 fixes AsyncVectorEnv, a single eval process achieves only ~20% GPU utilisation (env step ~20ms >> inference ~5ms). The remaining idle time can be recovered by running multiple independent eval processes (shards), each handling a disjoint slice of episodes and its own model copy. On an H100 (80 GB), SmolVLA at fp16 (~14 GB) fits 4–5 times → 4 × 20% ≈ 80–100% GPU utilisation with zero networking or coordination overhead.

This PR adds:

  1. Episode sharding in lerobot_eval.py: each process handles episodes shard_id, shard_id+N, ... with non-overlapping seeds.
  2. lerobot-eval-parallel: spawns K subprocesses, sets MUJOCO_GL and OMP_NUM_THREADS, merges results.
  3. lerobot-eval-autotune: probes GPU VRAM, CPU cores, model footprint, and env step time; outputs optimal num_shards / batch_size / MUJOCO_GL with a paste-ready command.

Related issues

What changed

  • configs/default.py (EvalConfig): add shard_id: int = 0, num_shards: int = 1; validate ranges in __post_init__
  • lerobot_eval.py: add _shard_episodes(n_episodes, shard_id, num_shards) → list[int]; eval_main computes per-shard episode count and seed offset; writes shard_K_of_N.json when num_shards > 1, else eval_info.json (default unchanged)
  • lerobot_eval_parallel.py (new, ~120 LOC): parse --num-shards / --render-device; spawn K subprocesses; wait; merge shard JSON files into eval_info.json
  • lerobot_eval_autotune.py (new, ~140 LOC): 8-step hardware probe → AutotuneRecommendation; main() prints summary + paste-ready command
  • pyproject.toml: register lerobot-eval-parallel and lerobot-eval-autotune entry points

Default behaviour is unchanged: num_shards=1 → exactly the same execution path as before.

How was this tested (or how to run locally)

Tests added:

  • test_shard_assignment: _shard_episodes(100, 2, 5) == [2, 7, 12, ..., 97]
  • test_shard_uneven: 103 episodes / 5 shards distributes without overlap or gap
  • test_shard_no_overlap: union of all shards == full episode range

Single-machine parallel run:

# Auto-detect optimal config
lerobot-eval-autotune policy.path=lerobot/smolvla_libero env.type=libero

# Run with 4 shards
lerobot-eval-parallel --num-shards 4 \
  policy.path=lerobot/smolvla_libero \
  env.type=libero \
  eval.n_episodes=200 \
  eval.batch_size=20 \
  output_dir=outputs/eval/parallel_run

# Let autotune decide
lerobot-eval-parallel --num-shards auto \
  policy.path=lerobot/smolvla_libero \
  env.type=libero

Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest)
  • Documentation updated
  • CI is green

Reviewer notes

  • subprocess.Popen (fork+exec) gives each shard a clean Python interpreter and its own valid EGL/osmesa context — no stale GPU handles inherited from the parent.
  • Seeds are non-overlapping: shard K starts at seed + K * ceil(n_episodes / num_shards), so the combined run is equivalent to one serial run with the same seeds.
  • --render-device auto: uses EGL (GPU) for 1 shard; switches to osmesa (CPU rendering, 0 VRAM) when multiple model copies would exhaust VRAM.
  • Anyone in the community is free to review the PR.

pkooij and others added 2 commits April 7, 2026 13:43
Add lerobot-eval-parallel and lerobot-eval-autotune entry points for
multi-process evaluation. A single H100 running 4 shards of SmolVLA
achieves ~100% GPU utilisation vs ~0.5% with the serial baseline.

- EvalConfig: add shard_id / num_shards fields; validate ranges
- lerobot_eval.py: _shard_episodes() splits n_episodes round-robin;
  eval_main uses per-shard n_episodes + seed offset; writes
  shard_K_of_N.json when num_shards > 1
- lerobot_eval_parallel.py: spawns K subprocesses with disjoint shard
  IDs, sets MUJOCO_GL and OMP_NUM_THREADS, merges results on completion
- lerobot_eval_autotune.py: probes GPU VRAM, CPU cores, optional model
  footprint and env step time; derives optimal num_shards / batch_size /
  MUJOCO_GL; prints a paste-ready command
- pyproject.toml: register lerobot-eval-parallel and lerobot-eval-autotune

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval_policy_all already supports running multiple task groups concurrently via
ThreadPoolExecutor, but policy.reset() was not thread-safe: all threads shared
the same policy object and its mutable state (action queues, temporal buffers).

Fix: each thread receives a shallow copy of the policy. copy.copy() creates a
new Python object whose _parameters dict is a shared reference — same tensor
storage, zero extra VRAM — while reset() rebinds per-episode state to fresh
objects per thread.

Caveat: ACT with temporal_ensemble_coeff is not safe with this approach (its
reset() mutates a shared sub-object). Keep max_parallel_tasks=1 for that config.

For MetaWorld (50 tasks, no temporal ensembling), max_parallel_tasks=4 raises
GPU utilization from ~20% to ~60-80% with no additional VRAM cost.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pkooij pkooij force-pushed the feat/eval-parallel branch from b411838 to 66276f1 Compare April 7, 2026 11:44
@pkooij pkooij closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant